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Abstract 

The  process  of  defining  physical  fitness  test  batteries  typically  relies  on  qualitative  evaluations  of 
individual  tests.  Starting  from  the  existing  consensus  regarding  the  mapping  of  physical  fitness 
tests  onto  physical  ability  constructs,  analyses  were  carried  out  to  develop  quantitative  test 
validity  indices  for  use  in  test  battery  design.  The  validity  indices  were  averaged  factor  loadings 
from  confirmatory  factor  analysis  (CFA)  of  the  inter-test  correlation  matrices  from  85 
independent  samples.  The  CFA  included  latent  traits  representing  muscular  strength,  muscular 
power,  muscular  endurance,  and  cardiorespiratory  endurance.  The  averaged  factor  loadings  came 
from  random  effects  analysis  of  the  factor  loadings  from  the  85  measurement  models.  The  results 
confirmed  the  accepted  assignment  of  fitness  tests  to  categories  representing  the  four  physical 
ability  constructs.  The  average  factor  loading  varied  from  test  to  test  within  each  category,  but 
the  inter-test  variation  generally  was  small  relative  to  the  standard  errors  of  the  individual 
loading  estimates.  The  modest  validity  differences  leave  considerable  freedom  to  use  additional 
criteria,  such  as  ease  of  administration,  time  requirements,  and  face  validity  from  the  perspective 
of  the  test  population,  when  designing  physical  fitness  test  batteries. 
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Construct  Validity  of  Physical  Fitness  Tests 

Physical  fitness  tests  measure  physical  abilities.  Fitness  test  batteries  frequently  are  used 
to  assess  the  ability  to  meet  occupational  performance  requirements.  Valid  tests  must  be  selected 
for  a  battery  to  derive  valid  inferences  about  performance  potential.  Occupational  requirements 
and  time  and  equipment  requirements  influence  test  battery  designs  for  specific  applications. 
Expert  judgment  is  the  primary  basis  for  choosing  tests.  The  experts  select  tests  that  are  believed 
to  measure  relevant  physical  abilities  and  that  can  be  administered  within  the  time  and  equipment 
constraints  for  testing.  Expert  judgment  is  largely  qualitative.  For  example,  judgments  must  be 
made  regarding  what  can  be  measured.  Judgment  is  needed  because  factor  analyses  have 
identified  between  3  (Hogan,  1991)  and  14  (Nicks  &  Fleishman,  1962)  physical  abilities  or 
physical  proficiency  factors  that  can  be  measured  by  physical  fitness  tests.  Judgments  must  be 
made  about  how  many  factors  there  really  are  and,  of  those,  which  ones  are  relevant  to  the 
current  testing  objectives.  The  factor  analytic  research  also  classifies  fitness  tests  into  groups 
representing  different  physical  abilities.  For  example,  push-ups  and  pull-ups  are  accepted 
muscular  endurance  (ME)  measures,  while  a  distance  run  is  an  accepted  measure  of 
cardiovascular  endurance  (CE).  When  the  decision  to  measure  a  given  ability  is  made,  additional 
decisions  are  needed  to  decide  which  test  or  tests  to  use  for  making  the  desired  measurements. 
The  additional  decisions  are  needed  because  the  current  state  of  the  art  provides  little  guidance 
for  choosing  among  the  tests  that  measure  the  particular  abilities  of  interest. 

This  paper  presents  a  reanalysis  of  a  large  volume  of  evidence  relating  physical  fitness 
tests  to  physical  abilities.  The  results  provide  a  catalogue  of  quantitative  construct  validity 
coefficients  for  individual  physical  fitness  tests.  The  catalogue  entries  rank  individual  physical 
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fitness  tests  from  most  to  least  valid  as  indicators  of  the  associated  ability  construct.  The  validity 
infonnation  can  supplement  expert  judgment  when  designing  fitness  test  batteries. 

The  catalogue  covers  four  major  physical  ability  constructs.  Hogan’s  (1985)  conceptual 
framework  was  the  starting  point  for  this  effort.  Hogan’s  framework  consisted  of  seven  physical 
ability  constructs  that 

. .  .provide  comprehensive  coverage  of  the  physical  performance  domain;  the 
dimensions  meet  the  following  four  criteria:  (a)  recognized  research  history;  (b) 
definition  consistent  with  human  physiology;  (c)  measurement  yielding  variability 
across  individuals;  (d)  association  with  perfonnance  in  a  variety  of  activities  and 
tasks.  (Hogan,  1985,  p.  220) 

This  paper  focuses  on  four  of  Hogan’s  (1985)  constructs:  muscular  strength  (MS), 
muscular  power  (MP),  ME,  and  CE.  These  physical  abilities  were  the  focus  because  they  appear 
most  frequently  in  the  job  perfonnance  literature.  Based  on  past  practice,  these  constructs  are 
likely  to  be  of  interest  in  designing  occupational  fitness  test  batteries. 

Hogan’s  (1985)  model  derived  from  an  extensive  history  of  factor  analyses  of  physical 
fitness  tests.  In  this  model,  MS  is  “(t)he  capacity  to  exert  force  as  a  result  of  tension  produced  in 
muscles.”  MP  is  “(t)he  capacity  to  exert  force  to  move  a  mass  a  given  distance  during  a 
measured  time.”  ME  is  “(t)he  capacity  of  muscles  to  continue  work  over  time  while  resisting 
fatigue.”  CE  is  “(t)he  capacity  of  the  heart  and  related  body  systems  to  sustain  prolonged 
muscular  activity.”  These  four  constructs  are  accepted  in  the  physical  fitness  literature  as  factors 
that  have  been  replicated  across  studies.  Tests  that  are  indicators  of  each  construct  have  been 
identified  in  the  replication  process. 
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This  study  focuses  on  the  problem  of  selecting  specific  tests  for  a  physical  fitness  test 
battery.  This  study  provides  analyses  that  make  it  possible  to  choose  tests  based  on  their 
construct-related  validity. 

Method 

Literature  Search 

The  validity  coefficients  derived  for  physical  fitness  tests  are  based  on  the  relationships 
among  those  tests.  A  literature  search  identified  papers  that  provided  correlation  matrices 
describing  those  relationships.  Previous  reviews  by  Nicks  and  Fleishman  (1962)  and  Fleishman 
(1964)  were  the  search  starting  points.  Computer  searches  coupled  the  four  ability  constructs 
with  the  keyword  “measurement.”  Variants  of  the  construct  names  were  employed.  An  ancestry 
search  of  papers  that  met  the  initial  inclusion  criteria  identified  additional  relevant  references. 

The  review  was  limited  to  studies  that  met  minimum  data  requirements.  The  primary 
criterion  was  that  the  study  had  to  report  a  correlation  matrix  that  included  at  least  three  tests  for 
a  single  physical  ability.  For  example,  a  study  was  included  if  it  reported  all  of  the  correlations 
between  three  or  more  ME  tests.  The  criterion  was  relaxed  slightly  for  studies  that  investigated 
two  or  more  physical  abilities.  Those  studies  were  included  if  the  correlation  matrix  covered  two 
or  more  tests  for  each  of  two  or  more  abilities.  These  inclusion  criteria  ensured  that  the  data 
would  produced  statistically  acceptable  measurement  models.  Sixty-eight  studies  that  met  the 
inclusion  criteria  reported  results  for  85  samples. 

Demographics 

The  typical  study  participant  was  a  man  (Table  1).  Roughly  50%  of  the  samples  and  50% 
of  the  total  number  of  participants  were  men.  Men  contributed  47%  of  all  test  scores.  The  true 
contribution  of  men  to  the  overall  data  probably  is  much  larger.  These  figures  excluded  Blakly, 
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Quinones,  Crawford,  and  Jago’s  (1994)  study  of  13,000  participants  who  provided  52,000  test 
scores. 

The  age-sex  composition  of  the  samples  was  noteworthy.  The  test  data  from  male  and 
female  subjects  were  combined  in  the  analyses  for  14  of  17  samples  of  children  and  adolescents 
compared  with  3  of  62  samples  of  adults. 

Construct  Sampling 

The  number  of  constructs  represented  in  the  measurement  models  varied  from  sample  to 
sample  (Table  2).  Isometric  MS,  ME,  MP,  and  CE  measures  were  administered  to  between  36% 
and  52%  of  all  samples.  Isotonic  MS  and  dynamic  MS  measures  were  administered  to  8%  and 
13%  of  the  samples,  respectively. 

The  measurement  models  included  as  many  as  six  latent  traits  even  though  the  literature 
search  focused  on  four  physical  ability  constructs.  The  additional  latent  traits  were  added 
because  different  MS  measurement  methods  produced  distinct  latent  traits  (Appendix  A). 
Preliminary  analyses  demonstrated  that  although  all  strength  tests  measured  the  same  general 
construct,  the  specific  measurement  method  affected  the  representation  of  that  construct.  For 
example,  the  strength  construct  defined  by  a  set  of  isometric  strength  tests  was  highly  correlated 
with,  but  not  identical  to,  the  strength  factor  defined  by  a  set  of  isoinertial  strength  tests.  Neither 
of  those  factors  was  identical  to  the  strength  factor  defined  by  a  set  of  dynamic  strength  tests. 
This  methodological  variation  meant  that  strength  tests  defined  as  many  as  three  latent  traits  in 
some  studies. 

Adding  three  latent  traits  for  MS  to  the  latent  traits  for  ME,  MP,  and  CE  meant  that  the 
ability  measurement  models  could  include  as  many  as  six  latent  traits.  Few  models  were  this 
complex.  The  measurement  model  represented  just  one  construct  in  36  samples.  The  model 
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represented  two  constructs  in  29  samples.  The  model  represented  three  constructs  in  13  samples. 
Four  constructs  were  represented  in  the  models  for  five  samples.  The  model  represented  six 
constructs  in  only  two  samples. 

Analysis  Procedures 

Model  construction  began  by  assigning  tests  to  physical  ability  categories.  Each  test  was 
assigned  to  a  single  category  based  on  its  usual  interpretation  in  the  testing  literature.  These 
assignments  were  straightforward  except  in  the  case  of  run  tests.  Shorter  runs  generally  are 
classified  as  MP  tests;  longer  runs  are  classified  as  CE  tests.  In  the  present  case,  run  tests  that 
covered  600  yd  or  less  were  classified  as  MP  indicators;  run  tests  that  covered  880  yd  or  more 
were  classified  as  ME  indicators.  This  cutoff  was  based  on  Disch,  Frankiewicz,  and  Jackson’s, 
(1975)  factor  analysis  of  performance  on  run  tests  of  50  yd,  100  yd,  0.50  mi,  .75  mi,  1.00  mi, 
1.25  mi,  1.50  mi,  1.75  mi,  2.00  mi,  and  12  min.  The  factor  analysis  produced  two  factors,  one 
defined  primarily  by  the  shortest  runs  and  one  defined  primarily  by  the  longest  runs. 
Intermediate  runs  of  0.50  mi  to  1.0  mi  had  much  larger  loadings  on  the  factor  defined  by  long 
runs  than  on  the  factor  defined  by  short  runs. 

The  confirmatory  factor  analysis  (CFA)  model  was  unidimensional  when  all  tests 
administered  to  a  sample  represented  a  single  construct.  The  test  battery  had  to  include  at  least 
three  tests,  the  minimum  number  required  to  identify  a  latent  trait.  Multidimensional  CFA 
models  were  constructed  when  the  correlation  matrix  included  two  or  more  tests  for  each  of  two 
or  more  ability  constructs. 

The  measurement  models  combined  free  and  constrained  factor  loadings  for  each 
physical  ability  test.  A  loading  for  each  test  on  its  hypothesized  factor  was  freely  estimated.  The 
factor  loadings  for  each  test  on  all  other  factors  were  fixed  at  zero. 


Physical  Fitness  Test  Validity  8 


The  remaining  CFA  model  elements  defined  the  latent  trait  structure  for  the  models.  All 
latent  traits  were  scaled  by  fixing  their  variances  at  1.00.  This  scaling  choice  made  it  possible  to 
estimate  factor  loadings  for  all  tests.  Also,  the  latent  trait  correlations  were  freely  estimated  in 
the  multidimensional  CFA  models.  The  analyses  were  conducted  using  LISREL  8  (Joreskog, 
Sorbom,  du  Toit,  &  du  Toit,  2000). 

Random  effects  (RE)  meta-analyses  estimated  the  average  latent  trait  loadings  (^Avg)  for 
each  test  on  its  hypothesized  ability  dimension.  The  meta-analytic  computations  weighted 
individual  latent  trait  loadings  by  the  inverse  of  their  variance.  The  variance  was  the  squared 
standard  error  for  the  loading  in  the  CFA  model.  This  weighting  scheme  was  adopted  after 
preliminary  analyses  demonstrated  that  the  CFA  models  produced  appropriate  standard  errors 
even  though  correlation  matrices  were  being  analyzed  (Appendix  B). 

An  RE  model  was  adopted  to  obtain  results  that  could  be  generalized  beyond  the  studies 
in  the  analysis  (see  Borenstein,  Hedges,  Higgins,  &  Rothstein,  2009).  An  RE  model  was 
appropriate  because  differences  in  participant  characteristics  (e.g.,  age,  sex,  general  fitness),  test 
setting  (e.g.,  academic  vs.  military),  and  procedural  differences  in  test  administration  (e.g.,  1-min 
push-ups  vs.  2-min  push-ups)  made  it  unlikely  a  priori  that  the  factor  loadings  would  be  invariant 
across  studies.  Furthermore,  RE  analyses  yield  fixed  effect  models  when  there  is  little  or  no 
empirical  variation  in  the  parameter  estimates.  An  SPSS-PC,  version  17,  syntax  program  to 
implement  the  procedures  in  Borenstein  et  al.  (2009)  was  written  and  applied  to  conduct  the 
analyses. 

Results 


Muscular  Strength 
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A  test  was  acceptable  if  ^Avg  was  significantly,/;  <  .05,  greater  than  .40.  This 
acceptability  criterion  was  more  stringent  than  the  X  =  0.40  rule  of  thumb  commonly  used  to 
identify  acceptable  latent  trait  indicators  in  exploratory  factor  analyses. 

Isometric  strength.  Isometric  strength  tests  measure  the  maximum  force  that  a  muscle  can 
generate  in  a  contraction  that  develops  force,  but  the  muscle  does  not  shorten  (Powers  & 

Howley,  1990).  Twenty-three  of  24  isometric  strength  tests  were  acceptable;  neck  flexion  was 
the  exception  (Table  3).  The  best  indicators  were  low  lift,  kAVg  =  .884;  shoulder  extension,  kAVg  = 
.828;  and  torso/upper  body  flexion,  kAVg  =  .816. 

Isotonic  strength.  Isotonic  strength  tests  involve  contractions  in  which  the  muscle 
shortens  against  a  fixed  resistance.  The  shortening  results  in  movement  (Powers  &  Howley, 
1990).  All  six  tests  were  acceptable  (Table  4).  The  best  options  were  bench  press,  kAvg  =  .856; 
and  shoulder  press,  kAVg  =.851. 

Dynamic  strength.  Dynamic  strength  tests  required  coordinated  lifting  actions  involving 
multiple  muscle  groups.  These  tests  were  akin  to  Olympic  weight  lifts.  The  dynamic  strength 
tests  in  this  review  were  performed  with  an  incremental  lift  machine.  Stevenson,  Bryant, 
Greenhorn,  Deakin,  and  Smith  (1995)  described  the  lift  dynamics.  Both  lift  tests  were  acceptable 
(Table  5).  The  kAVg  difference  was  too  small  to  designate  either  test  as  the  better  option. 

General  strength.  The  isometric,  isotonic,  and  dynamic  strength  latent  traits  were  highly 
correlated,  .758  <  r  <  .848  (see  Table  6).  Exploratory  factor  analysis  of  the  latent  trait 
correlations  produced  a  unidimensional  model  with  the  following  factor  loadings:  Isoinertial 
Strength,  X  =  .954;  Isometric  Strength,  X  =  .854;  and  Dynamic  Strength,  X  =  .889. 


Muscular  Endurance 
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Seven  of  nine  ME  tests  were  acceptable  (Table  7);  leg  lifts  and  half-hold  sit-ups  were  the 
exceptions.  Dips,  /AAvg  =  .761;  push-ups,  kAvg  =  .753,  pull-ups,  /AAvg  =  .720;  and  bent-ann  hang 
/AAvg  =  .699,  were  notably  superior  to  the  other  ME  tests,  including  sit-ups,  /AAvg  =  .498. 

Muscular  Power 

All  12  MP  tests  were  acceptable  (Table  8).  The  best  MP  indicators  were  the  100-yd  dash, 
/AAvg  =  .812,  and  the  300-yd  run,  kAvg  =  .786,  but  those  tests  have  been  infrequently  studied.  If 
attention  were  limited  to  those  tests  that  have  been  studied  frequently  ( k  >  10),  the  best  tests  were 
the  50-yd  dash,  kAvg  =  .764;  the  long  jump,  7Avg  =  .734;  the  vertical  jump,  AAvg  =  .672;  and  the 
medicine  ball  throw/shot  put  (A.Avg  =  .699).  The  AAvg  for  each  of  these  frequently  studied  tests  fell 
within  the  95%  confidence  intervals  for  the  100-yd  dash  and  the  300-yd  run,  so  all  six  tests  were 
statistically  equivalent.  The  kAvg  for  ergometer  tests  were  substantially  lower  than  those  for 
dashes  and  jumps:  ann  ergometer,  AAvg  =  .559;  leg  ergometer,  AAvg  =  .609. 

Cardiovascular  Endurance 

Eight  of  nine  CE  tests  were  AAvg  acceptable;  the  step  test  was  the  exception  (Table  9). 
Overlapping  confidence  intervals  for  the  distance  runs  made  it  impossible  to  designate  any  best 
choice(s).  The  average  factor  loading  for  VCfimax,  A-v-g  =  .707,  was  notably  weaker  than  that  for 
any  run  test. 

Latent  Trait  Correlations 

Correlations  between  the  physical  ability  latent  traits  were  moderate,  .448  <  r  <  .687, 
except  for  a  near-zero  correlation  of  MS  with  CE  (r  =  .088,  see  Table  10). 

CFA  Constraints 

The  CFA  models  estimated  a  factor  loading  for  each  test  on  a  single  factor.  The  models 
could  have  included  a  factor  loading  for  each  test  on  all  four  factors.  However,  the  models  fixed 
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three  possible  factor  loadings  for  each  test  at  zero.  These  constraints  on  secondary  factor 
loadings  might  have  been  inappropriate.  Performance  on  some  tests  might  be  influenced  by  two 
or  more  physical  abilities.  The  CFA  analysis  provided  infonnation  that  was  used  to  evaluate 
constraint  appropriateness.  In  particular,  the  output  included  estimates  of  what  the  secondary 
factor  loadings  would  have  been  if  they  had  been  freely  estimated. 

Constraint  appropriateness  was  evaluated  by  examining  the  77  secondary  factor  loadings 
that  had  been  estimated  in  three  or  more  analyses.  The  average  estimated  secondary  loading  was 
-.020  across  all  77  evaluated  pairs.  Only  4  of  77  pairs  produced  |  kEst  |  >  .40.  A  single  outlier 
value  accounted  for  the  large  average  loading  in  each  of  those  four  cases. 

Discussion 

Standard  practices  correctly  classify  fitness  tests  in  relation  to  general  physical  abilities. 
Using  the  standard  classifications  as  the  basis  for  CFA  models,  the  fitness  tests  were  acceptable 
ability  indicators  in  58  of  61  cases.  The  CFA  models  also  provided  enough  information  to 
evaluate  the  appropriateness  of  fixing  secondary  loadings  at  zero.  The  expected  value  of  those 
constrained  loadings  was  virtually  zero.  The  expectations  were  not  large  enough  to  justify  adding 
any  secondary  factor  loadings  given  the  risk  of  introducing  post  hoc  model  modifications  based 
on  chance  findings  (MacCallum,  Roznowski,  &  Necowitz,  1992).  Thus,  58  of  61  tests  were 
acceptable  indicators  of  their  specified  ability  construct  and  were  not  related  to  any  of  the  other 
ability  constructs. 

The  physical  ability  constructs  were  correlated.  The  typical  inter-trait  correlation  was 
moderately  large.  A  near-zero  correlation  of  MS  with  CE  was  the  exception  to  this  general  trend. 
The  latent  trait  correlations  establish  the  potential  for  omitted  variable  bias  (James,  Mulaik,  & 
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Brett,  1982).  Earlier  work  demonstrated  that  bias  can  occur  in  causal  models  relating  physical 
ability  to  physical  task  perfonnance  (Vickers,  Hodgdon,  &  Beckett,  2009). 

Having  latent  trait  correlations  in  the  measurement  model  was  unusual.  Past  work  has 
relied  on  orthogonal  factor  models.  A  model  with  correlated  dimensions  is  consistent  with  the 
subjective  impression  that  people  differ  in  general  fitness.  A  model  with  correlated  dimensions 
also  is  more  parsimonious.  In  the  present  case,  six  correlations  between  four  latent  traits  have 
been  substituted  for  the  183  secondary  factor  loadings  that  would  be  required  for  a  four¬ 
dimensional  orthogonal  model  for  61  ability  tests.  Model  parsimony  and  plausibility  both  favor  a 
correlated  abilities  model  over  an  orthogonal  abilities  model. 

The  ability  constructs  represent  performance  capacities  or  physical  proficiencies.  These 
constructs  should  not  be  equated  with  specific  physiological  processes.  The  relatively  modest 
factor  loadings  for  laboratory  tests  of  anaerobic  and  aerobic  capacities  support  this  view.  If  the 
CFA  measurement  models  had  been  recast  as  causal  models,  the  laboratory  tests  would  have 
defined  physiological  constructs  that  caused  perfonnance  differences.  Had  this  been  done,  the 
physiological  latent  trait  processes  typically  would  have  been  identical  to  the  laboratory  test.  The 
identity  would  occur  because  most  studies  included  only  one  laboratory  test  for  the  relevant 
physiological  capacities.  Given  the  laboratory  test-physiological  process  identity,  the  estimated 
causal  effects  of  the  physiological  processes  on  perfonnance  would  have  been  identical  to  the 
laboratory  test  factor  loadings  in  the  CFA  measurement  models.  This  factor  loading 
interpretation  would  mean  that  anaerobic  power  accounts  for  30%  of  the  MP  variance  if  the  ann 
ergometer  test  is  chosen,  and  36%  of  the  MP  variance  if  the  leg  ergometer  test  is  chosen. 
Faboratory-based  aerobic  capacity  measures  account  for  50%  of  the  CE  performance  variance. 
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The  factor  loadings  have  a  simple  practical  interpretation.  The  factor  loadings  are  the 
correlation  of  test  scores  with  the  latent  traits.  This  correlation  can  be  transformed  to  answer  the 
question  “How  accurately  will  test  scores  identify  individuals  with  above  average  ability?”  A 
simple  classification  rule  would  predict  that  an  individual  with  an  above  average  test  score  was 
above  average  on  ability.  Using  this  rule,  Rosenthal  and  Rubin’s  (1982)  binomial  effect  size 
display  (BESD)  converts  correlations  into  the  percentage  of  individuals  in  a  sample  who  will  be 
correctly  classified  using  the  stated  rule.  In  the  current  context,  BESD  =  50  +  (50*  7Avg).  The 
median  BESD  was  88%  (range  =  67%-99%).  This  figure  is  substantially  higher  than  the  50% 
accuracy  that  would  be  expected  if  no  test  were  given.  With  no  information,  accurate  prediction 
would  be  expected  in  50%  of  all  cases.  Thus,  another  interpretation  is  that  the  7Avg  value  for  a 
test  is  the  proportional  reduction  in  error  (PRE)  associated  with  using  that  test  instead  of 
guessing  (Hildebrand,  Laing,  &  Rosenthal,  1977). 

BESD  and  PRE  provide  a  frame  of  reference  for  choosing  between  tests.  Suppose  Test  A 
requires  less  time  and  equipment  to  administer  than  Test  B.  If  Test  B  is  less  accurate,  Test  A  is 
the  clear  choice.  If  Test  B  is  more  accurate  than  Test  A,  the  choice  becomes  more  complex.  Test 
accuracy  must  be  weighed  against  administrative  simplicity.  The  accuracy  difference  will  be  too 
small  to  be  important  in  many  comparisons. 

Even  apparently  large  accuracy  differences  must  be  treated  with  caution.  Some  7Avg 
estimates  are  based  on  data  from  a  few  small  samples.  Those  estimates  will  tend  to  have  large 
95%  confidence  intervals.  A  conservative  treatment  would  consider  this  fact  when  comparing 
tests.  Suppose  7Avg  for  Test  A  is  greater  than  7Avg  for  Test  B.  Tests  A  and  B  still  could  be 
regarded  as  equivalent  if  the  95%  confidence  interval  for  Test  A  included  the  Test  B  7Avg 
estimate.  If  the  95%  confidence  interval  for  Test  A  is  broad,  it  can  be  appropriate  to  consider 
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tests  as  equivalent  even  though  the  difference  in  their  A,4Vg  values  is  large.  This  issue  is  most 
likely  to  arise  when  considering  tests  that  have  been  used  infrequently  in  the  past.  Those  tests  are 
the  ones  that  are  likely  to  have  wide  confidence  intervals.  Additional  study  of  promising 
alternatives  that  have  been  infrequently  used  in  past  research  could  be  useful. 

The  results  provide  some  general  guidelines  for  test  battery  design.  On  the  whole,  the 
evidence  supports  the  common  practice  of  focusing  on  administrative  simplicity.  Usually, 
several  tests  have  ^AVg  values  that  make  them  equivalent  ability  indicators  for  practical  purposes. 
Administrative  simplicity  is  a  reasonable  basis  for  choosing  among  those  tests. 

Test  battery  measurement  precision  can  be  increased  by  including  multiple  indicators  for 
MS,  MP,  and  ME,  when  possible.  The  tests  in  these  three  domains  have  moderate  Aavq  values. 
However,  the  tests  load  on  the  same  factor  because  they  are  correlated.  The  sum  of  the 
standardized  scores  for  two  or  more  tests  will  estimate  true  ability  more  accurately  than  any 
single  test  (Nunnally  &  Bernstein,  1994).  Choosing  the  tests  with  the  highest  AAvg  values  will 
maximize  accuracy.  Note  that  using  multiple  CE  tests  will  have  little  value  because  the  large  ^Avg 
values  for  run  tests  leave  little  room  for  improving  the  precision  of  a  single  test. 

The  limitations  of  this  work  must  be  considered  to  evaluate  the  results  properly.  The 
available  evidence  is  skewed  toward  school-aged  children  and  collegians  in  physical  education 
classes.  Tests  have  not  been  randomly  paired  within  or  across  the  ability  domains.  The  analyses 
treat  test  administration  differences  (e.g.,  push-ups  in  1  min  or  2  min)  as  random  variance 
sources.  Constructs  may  not  have  been  correctly  interpreted.  Lower  body  tests  defined  MP,  and 
upper  body  tests  defined  ME.  Perhaps  both  traits  are  narrow  expressions  of  a  general  capacity  for 
repetitive  submaximal  exertions.  Some  relevant  data  had  to  be  omitted.  Marsh  (1993)  produced 
a  structural  equation  model  that  demonstrated  invariance  of  physical  ability  latent  traits  across 
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sex  and  age  groups  in  a  large  sample.  The  assignments  of  tests  to  latent  traits  was  sufficiently 
different  from  the  assignments  in  this  analysis  to  equate  results  from  that  study  to  the  present 
findings.  Finally,  the  lack  of  simple  search  terms  to  identify  studies  reporting  inter-test 
correlations  makes  it  almost  certain  that  the  literature  search  has  overlooked  some  useful 
correlation  matrices. 

The  ideal  outcome  would  have  been  the  identification  of  the  best  possible  physical  ability 
test  battery.  Instead,  the  evidence  indicates  that  a  number  of  equivalent  test  batteries  can  be 
constructed  by  defining  sets  of  practically  equivalent  tests  as  the  best  choices  within  each 
physical  ability  domain.  Equivalent  test  batteries  then  can  be  constructed  by  selecting  one  or 
more  from  each  of  the  four  “best  test”  sets.  The  failure  to  define  the  best  possible  test  battery 
might  be  regarded  as  an  outcome  limitation,  but  guidance  on  how  to  construct  equivalent  test 
batteries  may  be  a  more  useful  outcome.  This  outcome  provides  the  practitioner  with  flexibility 
in  battery  design  coupled  with  confidence  that  his  or  her  battery  is  optimum  or  close  enough  for 
practical  applications. 

Despite  limitations,  this  review  has  produced  several  useful  findings.  The  common 
treatment  of  MS,  MP,  ME,  and  CE  as  distinct  ability  constructs  was  supported.  The  results  went 
beyond  this  simple  affirmation  by  showing  that  the  ability  constructs  are  correlated  in  the  general 
population.  The  analyses  sharpened  the  interpretation  of  the  ability  constructs  by  highlighting  the 
fact  that  these  constructs  represent  performance  capacities  that  should  not  be  equated  with 
specific  physiological  processes.  Estimates  of  the  effects  of  anaerobic  and  aerobic  capacities  on 
MP  and  CE  performance  were  obtained  as  incidental  modeling  outcomes.  The  evidence 
supported  the  standard  mapping  of  tests  onto  ability  constructs  and  showed  that  tests  are  specific 
to  a  particular  physical  ability  once  the  correlations  between  abilities  are  recognized.  Test  battery 
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design  has  been  facilitated  by  providing  a  set  of  >.Avg  values  suitable  for  designing  efficient, 
reliable  fitness  test  batteries. 
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Table  1 


Sample  Descriptions 


k 

N 

No.  of  Test  Scores 

Adults 

Men 

41 

6,680 

57,566 

Women 

18 

3,200 

26,784 

Men  and  women 

3 

468 

17,699 

Children 

Boys 

1 

20 

60 

Girls 

2 

118 

854 

Boys  and  girls 

14 

2,042 

14,617 

Totals 

Male 

42 

6,700 

57,626 

Female 

20 

3,318 

27,638 

Adult 

62 

10,348 

102,049 

Child 

17 

2,180 

15,331 

No  information 

5 

743 

5,724 

Grand  total 

84 

13,271 

123,104 

Note.  Cumulative  values  for  age  and  gender  groups  do  not  equal  the  total  because  they  do  not  include  those  samples 
for  which  no  demographic  information  was  available.  The  table  omits  Blakly  et  al.’s  (1994)  sample  of  N  =  13,000 
men  and  women  who  contributed  52,000  test  scores  so  that  one  exceptional  sample  did  not  inflate  the  totals. 
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Table  2 


Data  Distribution  by  Ability  Construct 


k 

N 

No.  of  Test  Scores 

Isometric  strength 

44 

6,808 

30,440 

Isotonic  strength 

7 

1,455 

3,068 

Dynamic  strength 

9 

1,315 

6,455 

Muscular  endurance 

36 

10,112 

32,024 

Muscular  power 

37 

7,390 

28,680 

Cardio  endurance 

30 

2,747 

7,136 

Total 

84 

13,271 

107,803 

Note.  The  tabled  values  omit  Blakly  et  al.’s  (1994)  sample  of  N=  13,000  men  and  women  who  contributed  52,000 
test  scores  so  that  one  exceptional  sample  did  not  inflate  the  totals. 
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Table  3 


Isometric  Strength  Test  Results 


Test 

k 

^Avg 

SE 

95%  Cl  Bounds 

Lower  Upper 

0 

Sig 

z 

Sig 

Low  lift 

9 

.884 

.031 

.826 

.942 

6.49 

.592 

15.49 

.000 

Shoulder  extension 

3 

.828 

.033 

.731 

.924 

1.53 

.465 

12.94 

.000 

Torso/upper  body  flexion 

4 

.816 

.052 

.692 

.939 

3.09 

.377 

7.93 

.000 

Back  dynamometer 

4 

.788 

.095 

.563 

1.012 

2.42 

.490 

4.06 

.000 

Hip  flexion 

5 

.782 

.046 

.684 

.881 

2.27 

.685 

8.26 

.000 

Shoulder  flexion 

4 

.776 

.024 

.719 

.834 

2.90 

.408 

15.48 

.000 

Medium  lift 

9 

.763 

.031 

.706 

.820 

7.56 

.477 

11.84 

.000 

Elbow  flexion 

7 

.762 

.042 

.681 

.843 

4.68 

.586 

8.71 

.000 

Back  lift 

12 

.737 

.043 

.660 

.814 

15.59 

.157 

7.84 

.000 

Knee  extension 

11 

.723 

.040 

.651 

.796 

9.48 

.487 

8.10 

.000 

Ann  dynamometer 

5 

.723 

.128 

.449 

.997 

2.83 

.587 

2.52 

.006 

Leg  lift 

11 

.716 

.030 

.661 

.771 

13.59 

.193 

10.38 

.000 

Ann  lift 

13 

.692 

.024 

.649 

.736 

13.09 

.362 

12.05 

.000 

Trunk  flexion 

11 

.690 

.032 

.632 

.747 

13.34 

.205 

9.16 

.000 

(continued) 
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Table  3  (continued) 

Isometric  Strength  Test  Results 


Test 

k 

^Avg 

SE 

95%  Cl  Bounds 

Lower  Upper 

0 

Sig 

z 

Sig 

Ann  pull 

13 

.684 

.026 

.637 

.730 

11.74 

.467 

10.87 

.000 

Ankle  plantarflexion 

7 

.675 

.050 

.577 

.772 

6.08 

.414 

5.47 

.000 

Trunk  extension 

16 

.667 

.020 

.631 

.703 

15.16 

.440 

13.10 

.000 

Elbow  extension 

4 

.666 

.053 

.541 

.791 

3.68 

.298 

4.99 

.000 

Handgrip 

35 

.652 

.021 

.616 

.688 

33.68 

.483 

11.94 

.000 

Knee  flexion 

6 

.648 

.063 

.521 

.775 

3.91 

.562 

3.93 

.000 

Hip  extension 

10 

.623 

.064 

.506 

.740 

6.20 

.719 

3.50 

.000 

Neck  extension 

3 

.599 

.053 

.444 

.754 

1.94 

.379 

3.75 

.000 

Ankle  dorsiflexion 

5 

.556 

.054 

.440 

.671 

4.40 

.355 

2.88 

.002 

Neck  flexion 

3 

.492 

.099 

.203 

.780 

2.57 

.276 

.93 

.177 

Note,  k  is  the  number  of  samples  that  provided  results  for  the  test.  The  table  includes  all  isometric  strength  tests  for 
which  k>  3.  XAvg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis.  Cl  is  confidence  interval. 
Q  is  a  measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an  asymptotic  X  distribution  with  k  - 
1  df.  z  is  a  test  of  the  hypothesis  that  XAvg  >  .40.  SE  =  standard  error. 
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Table  4 


Isotonic  Strength  Test  Results 


Test 

k 

^-Avg 

SE 

95%  Cl  Bounds 

Lower  Upper 

Q 

Sig 

z 

Sig 

Bench  press 

8 

.856 

.038 

.785 

.928 

7.08 

.421 

12.09 

.000 

Shoulder  press 

7 

.851 

.025 

.802 

.900 

2.76 

.838 

17.83 

.000 

Lat  pull¬ 

7 

.797 

.028 

.743 

.852 

5.01 

.542 

14.20 

.000 

down/trapezius 

Ann  curl 

8 

.750 

.036 

.682 

.818 

7.98 

.334 

9.79 

.000 

Knee  ext 

4 

.607 

.056 

.475 

.740 

3.40 

.334 

3.67 

.000 

Leg  extension 

9 

.603 

.032 

.543 

.663 

8.78 

.362 

6.27 

.000 

Note,  k  is  the  number  of  samples  that  provided  results  for  the  test.  The  table  includes  all  isometric  strength  tests  for 
which  k  >  3.  XAvg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis.  Cl  is  confidence  interval. 
Q  is  a  measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an  asymptotic  %2  distribution  with  k  - 
1  df  z  is  a  test  of  the  hypothesis  that  XAvg  >  .40.  SE  =  standard  error. 
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Table  5 


Dynamic  Strength  Test  Results 


95%  Cl  Bounds 

Test 

k 

^-Avg 

SE 

Lower 

Upper 

Q 

Sig 

z 

Sig 

ILM  high 

7 

.928 

.021 

.887 

.969 

3.45 

.751 

25.02 

.000 

ILM  low 

7 

.856 

.047 

.766 

.946 

7.52 

.275 

9.80 

.000 

Note.  1LM  high  =  incremental  lift  machine  lift  to  180  cm;  1LM  low  =  incremental  lift  machine  lift  to  152  cm.  k  is  the 
number  of  samples  that  provided  results  for  the  test.  The  table  includes  all  isometric  strength  tests  for  which  k  >  3. 
XAvg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis.  Cl  is  confidence  interval.  Q  is  a 
measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an  asymptotic  y2  distribution  with  k—  1  df.z 
is  a  test  of  the  hypothesis  that  XAvg  >  .40. 
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Table  6 


Correlations  of  Modality-Specific  Strength  Factors 


Factor 

1 

2 

3 

1 .  Isotonic 

1.000 

2.  Isometric 

.815  (k=  6) 

1.000 

3.  Dynamic 

.848  (k  =  3) 

.758  (k  =  7) 

1.000 

Note.  Table  entries  are  the  pooled  correlations  between  the  general  ability  factors.  The  k  values  are  the  number  of 
samples  that  provided  estimates  of  each  correlation.  Isotonic  =  isotonic  strength. 
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Table  7 

Muscular  Endurance  Test  Results 


Test 

k 

>> 

> 

< 

CfQ 

SE 

95%  Cl  Bounds 

Lower  Upper 

0 

Sig 

a 

Z 

Sig 

Dips 

7 

.761 

.051 

.662 

.861 

5.16 

.523 

7.06 

.000 

Push-up 

20 

.753 

.038 

.687 

.818 

18.40 

.496 

9.32 

.000 

Pull-up 

30 

.720 

.030 

.669 

.770 

25.28 

.664 

10.73 

.000 

Bent-arm  hang 

11 

.699 

.045 

.617 

.781 

8.29 

.601 

6.63 

.000 

Endurance 

14 

.549 

.067 

.430 

.667 

14.21 

.359 

2.23 

.013 

Sit-up 

27 

.498 

.023 

.459 

.538 

21.57 

.712 

4.25 

.000 

Squat 

10 

.474 

.038 

.404 

.544 

7.53 

.582 

1.95 

.026 

Leg  lift/raise 

8 

.421 

.057 

.313 

.529 

7.42 

.387 

.37 

.356 

Half-hold  sit-up 

6 

.363 

.030 

.302 

.424 

4.38 

.496 

-1.22 

.888 

Note,  k  is  the  number  of  samples  that  provided  results  for  the  test.  The  table  includes  all  isometric  strength  tests  for 
which  k>  3.  XAvg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis.  Cl  is  confidence  interval. 
Q  is  a  measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an  asymptotic  %2  distribution  with  k  - 
1  df  z  is  a  test  of  the  hypothesis  that  \Avg  >  .40.  SB  =  standard  error. 
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Table  8 

Muscular  Power  Test  Results 


Test 

k 

^-Avg 

SE 

95%  Cl  Bounds 

Lower  Upper 

Q 

Sig 

z 

Sig 

100-yd  dash 

4 

.812 

.070 

.648 

.976 

.87 

.833 

5.92 

.000 

300-yd  run 

4 

.786 

.077 

.605 

.966 

3.24 

.356 

5.04 

.000 

10-yd  dash 

2 

.782 

.055 

.434 

1.130 

.74 

.389 

6.94 

.000 

50-yd  dash3 

22 

.764 

.037 

.700 

.828 

24.06 

.290 

9.84 

.000 

Shuttle  run 

8 

.746 

.060 

.633 

.860 

13.62 

.058 

5.76 

.000 

Long  jump 

30 

.734 

.029 

.685 

.783 

29.94 

.417 

11.62 

.000 

600-yd  run 

7 

.705 

.050 

.608 

.801 

7.76 

.256 

6.14 

.000 

Vertical  jump 

25 

.672 

.026 

.628 

.717 

21.09 

.633 

10.47 

.000 

Medicine  ball/shot  put 

10 

.664 

.072 

.531 

.797 

10.97 

.278 

3.64 

.000 

40-yd  dash 

5 

.653 

.151 

.330 

.975 

4.53 

.339 

1.67 

.048 

Leg  ergometer 

8 

.609 

.068 

.480 

.737 

6.09 

.530 

3.07 

.001 

Ann  ergometer 

6 

.559 

.053 

.453 

.666 

4.86 

.433 

3.02 

.001 

Note,  k  is  the  number  of  samples  that  provided  results  for  the  test.  The  table  includes  all  isometric  strength  tests  for 
which  k>  3.  Xjyg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis.  Cl  is  confidence  interval. 
Q  is  a  measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an  asymptotic  y2  distribution  with  k  - 
1  df.  z  is  a  test  of  the  hypothesis  that  XAvg  >  .40.  SE  =  standard  error. 

“Includes  one  60-yd  dash. 
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Table  9 

Cardiovascular  Endurance  Test  Results 


Test 

k 

X-Avgg 

SE 

95%  Cl  Bounds 

Lower  Upper 

Q 

Sig 

a 

Z 

Sig 

2 -mi  run 

6 

.908 

.063 

.781 

1.034 

.61 

.987 

8.10 

.000 

1-mi  run 

10 

.891 

.047 

.804 

.978 

8.98 

.439 

10.36 

.000 

880-yd  run 

4 

.889 

.044 

.785 

.993 

.67 

.881 

11.03 

.000 

3 -mi  run3 

4 

.886 

.092 

.670 

1.102 

.68 

.877 

5.30 

.000 

1.5 -mi  run 

5 

.880 

.051 

.772 

.988 

3.62 

.460 

9.50 

.000 

12 -min  run 

11 

.821 

.038 

.752 

.891 

8.54 

.576 

10.95 

.000 

1-  to  1.2-km  run 

5 

.792 

.063 

.658 

.926 

2.41 

.660 

6.24 

.000 

Vo2maxb 

20 

.707 

.063 

.598 

.817 

11.74 

.896 

4.85 

.000 

Step  test 

5 

.362 

.044 

.268 

.457 

4.04 

.401 

-.85 

.801 

Note.  Runs  >10  km  have  been  omitted  from  the  table  because  it  is  unlikely  that  those  distances  would  be  considered 
for  inclusion  in  fitness  tests,  k  is  the  number  of  samples  that  provided  results  for  the  test.  The  table  includes  all 
isometric  strength  tests  for  which  k>  3.  XAvg  is  the  weighted  average  factor  loading  from  the  random  effects  analysis. 
Cl  is  confidence  interval.  Q  is  a  measure  of  dispersion  of  the  random  effects  estimates.  This  statistic  has  an 
asymptotic  %2  distribution  with  k-  1  df.  z  is  a  test  of  the  hypothesis  that  XAvg>  .40.  SE  =  standard  error. 

“Includes  one  5 -km  run. 

bLaboratory  measurement  of  maximal  oxygen  uptake. 
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Table  10 


Intercorrelations  of  Ability  Factors 


Ability  Factor 

Isometric  MS 

MP 

ME 

CE 

Isometric  MS 

1.000 

MP 

.572  (£=11) 

1.000 

ME 

.448  (£=18) 

.687  (£=24) 

1.000 

CE 

.088  (£  =  4) 

.504  (£=  13) 

.595  (£=  11) 

1.000 

Note.  Table  entries  are  the  pooled  correlations  between  the  general  ability  factors.  The  k  values  are  the  number  of 
samples  that  provided  estimates  of  each  correlation.  Isometric  MS  =  isometric  muscular  strength;  MP  =  muscular 
power;  ME  =  muscular  endurance;  CE  =  cardiovascular  endurance. 
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APPENDIX  A 

Muscle  Strength  Measurement  Model 

Should  muscular  strength  be  represented  as  a  single  general  construct  or  is  it  more 
appropriate  to  represent  muscular  strength  as  a  set  of  correlated  dimensions  representing 
different  measurement  methods?  To  answer  this  question,  a  unidimensional  strength  model  was 
compared  with  a  multidimensional  model  in  10  data  sets  that  included  >2  strength  tests  for  >2 
measurement  methods.  All  of  the  strength  tests  loaded  on  a  single  factor  in  the  unidimensional 
model.  Tests  loaded  on  isometric,  isotonic,  or  dynamic  strength  factors,  as  appropriate  for  the 
test,  in  the  multidimensional  models.  The  Singh  et  al.  (1991)  model  included  only  the  right  side 
measurement  for  each  bilateral  exercise. 

The  multidimensional  goodness  of  fit  was  significantly  better  in  9  of  the  10  analyses 
(Table  A),  so  the  cumulative  improvement  in  fit  was  statistically  significant  (%  =  394.54,  16  df,p 
<  .001).  The  consistent  trend  was  more  important  than  the  significance  in  any  given  sample  or 
the  cumulative  significance.  The  root  mean  square  error  of  approximation,  non-normed  fit  index, 
and  standardized  root  mean  residual,  all  of  which  are  widely  used  goodness-of-fit  indices 
indicated  modest  gains  in  absolute  fit. 
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Table  A 


Comparison  of  Unidimensional  With  Multidimensional  Strength  Models 


Model 

#  Dim 

1 

Sig 

RMSEA 

NNF1 

Ax" 

df 

ANNF1 

Beckett  &  Hodgdon 

Women 

1 

56.85 

.000 

.130 

.794 

3 

55.00 

.001 

.139 

.752 

1.85 

3 

Model  comparison 

1.85 

3 

-.042 

Men 

1 

112.24 

.000 

.187 

.817 

3 

80.21 

.000 

.155 

.861 

Model  comparison 

32.03 

3 

.044 

Marcinik  studies 

Orlando 

1 

190.76 

.000 

.146 

.817 

3 

113.55 

.000 

.106 

.892 

77.21 

3 

Model  comparison 

77.21 

3 

.075 

Shipboard 

1 

73.99 

.000 

.176 

.837 

2 

63.46 

.000 

.160 

.847 

Model  comparison 

10.53 

1 

.010 

Sparten 

1 

102.25 

.000 

.150 

.864 

2 

51.70 

.003 

.089 

.943 

Model  comparison 

51.95 

1 

.089 

(continued) 
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Table  A  (continued) 

Comparison  of  Unidimensional  With  Multidimensional  Strength  Models 


Model 

#  Dim 

1 

Sig 

RMSEA 

NNFI 

Ax" 

df 

A  NNFI 

Myers  et  al. 

Men 

1 

76.34 

.000 

.273 

.852 

2 

2.02 

.156 

.045 

.996 

Model  comparison 

74.32 

1 

.144 

Women 

1 

90.83 

.000 

.298 

.744 

2 

.01 

.912 

.000 

1.01 

Model  comparison 

89.82 

1 

.257 

Singh  et  al. 

1 

503.28 

.000 

.200 

.504 

2 

497.16 

.000 

.200 

.501 

6.12 

1 

Model  comparison 

6.12 

1 

-.003 

Teves  et  al. 

Men 

1 

34.79 

.000 

.262 

.772 

2 

4.33 

.363 

.024 

.997 

Model  comparison 

30.46 

1 

.225 

Women 

1 

21.08 

.001 

.167 

.840 

2 

.83 

.934 

.000 

1.04 

Model  comparison 

20.25 

1 

.164 

Note.  Arbuckle  and  Wothke  (1999,  pp.  395-416)  provide  definitions  of  the  root  mean  square  error 
of  approximation  (RMSEA)  and  the  non-normed  fit  index  (NNFI).  The  A  column  indicates  the 
improvement  in  fit  obtained  by  moving  from  the  unidimensional  model  to  the  multidimensional 


model. 
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APPENDIX  B 

Evaluation  of  Meta- Analysis  Parameter  Weights 

Accurate  standard  errors  were  essential  for  the  planned  meta-analyses.  Accuracy  was 
critical  because  these  error  estimates  provided  the  basis  for  weighting  the  loadings  when 
computing  the  pooled  factor  loadings  (Borenstein,  Hedges,  Higgins,  &  Rothstein,  2009).  Correct 
weighting  was  essential  for  valid  analytical  results. 

Analyzing  correlation  matrices  can  distort  standard  error  estimates  (Joreskog,  Sorbom,  du 
Toit,  &  du  Toit,  2000,  Appendix  C,  pp.  209-214).  Cudeck  (1989)  demonstrated  two  minimum 
requirements  for  obtaining  accurate  standard  errors  when  correlation  matrices  are  analyzed.  All 
model  parameters  must  be  freely  estimated  and  the  reproduced  correlation  matrix  must  have  ones 
on  the  diagonal.  Every  CFA  model  in  this  study  satisfied  the  first  requirement.  The  second 
condition  held  except  in  the  data  from  Falls,  Ismail,  and  MacLeod  (1966),  so  that  sample  was 
dropped  from  the  meta-analysis. 

CFA  models  based  on  covariance  matrices  provided  additional  reason  to  accept  the 
standard  errors  derived  from  analyses  of  correlation  matrices.  Standard  deviations  for  the  fitness 
test  scores  had  been  reported  in  some  studies.  Covariance  matrices  could  be  constructed  for  those 
studies  by  combining  the  standard  deviations  with  the  correlation  matrices. 

Covariance-based  CFA  models  were  evaluated  for  three  studies.  Two  findings  from  those 
analyses  supported  the  accuracy  of  the  correlation  models.  First,  the  completely  standardized 
factor  loadings  were  identical  to  the  corresponding  loadings  from  the  correlation  analyses. 
Second,  the  t  values  for  the  corresponding  factor  loadings  were  identical  in  the  two  analyses. 
These  results  suggested  that  the  CFA  models  were  invariant  under  the  transfonnation  from 


covariance  matrices  to  correlation  matrices.  Invariance  under  transfonnation  is  a  third  condition 
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that  is  associated  with  accurate  standard  error  estimations  in  correlation  matrix  analyses 
(Joreskog  et  ah,  2000). 
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